Today:

  • Understand the mean-variance relationship in count data GLMs
  • Fit, evaluate and compare different models for count data
  • Interpret Poisson GLM parameters
library(tidyverse)
library(ggbeeswarm)
library(janitor)
library(emmeans)
library(glmmTMB)
## Warning in checkDepPackageVersion(dep_pkg = "TMB"): Package version inconsistency detected.
## glmmTMB was built with TMB version 1.7.20
## Current TMB version is 1.7.21
## Please re-install glmmTMB from source or restore original 'TMB' package (see '?reinstalling' for more information)
library(car)
library(DHARMa)
gambusia <- read_csv("Data/Gambusia.csv")
## Rows: 126 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Focal_ID, Treatment, Tagged, Trial_Date, Rival_ID, Female_ID
## dbl (8): Row_ID, Trial, Focal_Size, Time_Female.sec, No.Cop_Attempts, No.Cop...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Yesterday, we fitted the model:

summary(m_poisson <- glmmTMB(No.Cop_Success ~ 1 + Treatment,
                             data=gambusia, family = poisson))
##  Family: poisson  ( log )
## Formula:          No.Cop_Success ~ 1 + Treatment
## Data: gambusia
## 
##      AIC      BIC   logLik deviance df.resid 
##    624.9    630.6   -310.5    620.9      124 
## 
## 
## Conditional model:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      0.52030    0.09713   5.357 8.47e-08 ***
## TreatmentWinner  0.16487    0.13204   1.249    0.212    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using simulations, we saw that this model was not a very good representation of that data. The model is not aware of all the variation in the data:

fsimul <- function(x) {
  tibble(simul=simulate(m_poisson)) %>% 
  tabyl(simul)
  }

tibbsimul2 <- map_dfr(.x = 1:100, .f = fsimul)
ggplot(gambusia, aes(x=No.Cop_Success))+geom_bar() +
  geom_jitter(data = tibbsimul2, 
             mapping = aes(x=simul, y=n, col=as.factor(simul)), alpha=0.4)

What is a count data GLM

library(DiagrammeR)

grViz("
digraph boxes_and_circles {
node [shape = square, color = DeepSkyBlue]
response; predictor

node [shape = diamond, color=ForestGreen]
intercept; effect

node [shape = circle, color = black]
'expected value';

'expected value' -> 'average count' [ label = '  exponential']; 
'average count' -> response [ label = '  rpois(lambda=average count)', style=dashed]; 
intercept -> 'expected value';
predictor -> effect  [arrowhead = none, label = ' \U00D7']; 
effect -> 'expected value' ;
}")

With a simple Poisson family, a single parameter links the average count to the response. That can be a problem because the average count control both the mean, the variance, and all aspects of the distribution of the response.

Automated tests of fit: DHARMa

Running simulations like that to assess your models can be very useful and sometimes is required to understand what is wrong, whether it matters, or what to do about it otherwise. Fortunately, for routine checks we can assess our models more directly with the package DHARMa.

library(DHARMa)

The function testResiduals() first simulate data from our model, looks at the distance between model predictions and simulated data (these distances are “residuals”), then we compare the simulated residuals to the real data residuals:

testResiduals(m_poisson)

## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.3233, p-value = 7.277e-12
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 3.8684, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa bootstrapped outlier test
## 
## data:  simulationOutput
## outliers at both margin(s) = 7, observations = 126, p-value < 2.2e-16
## alternative hypothesis: two.sided
##  percent confidence interval:
##  0.00000000 0.01210317
## sample estimates:
## outlier frequency (expected: 0.00325396825396825 ) 
##                                         0.05555556
## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.3233, p-value = 7.277e-12
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 3.8684, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa bootstrapped outlier test
## 
## data:  simulationOutput
## outliers at both margin(s) = 7, observations = 126, p-value < 2.2e-16
## alternative hypothesis: two.sided
##  percent confidence interval:
##  0.00000000 0.01210317
## sample estimates:
## outlier frequency (expected: 0.00325396825396825 ) 
##                                         0.05555556

Here we can diagnose at least three problems:

  • The model does not fit the data (KS tests)
  • The model misses a lot of variation (Dispersion test)
  • The model cannot predict extreme values (Outlier test)

It is exactly what we saw in our home made simulations earlier. You can use either approach.

How to build a better model? First we need to understand what is wrong exactly in the model structure.

The Poisson process

Our GLM assumes that the data are generated by a Poisson process. Let’s simulate that process to understand it.

dat <- tibble(x=rpois(n = 10000, lambda = 2.8))
ggplot(dat, aes(x=x))+ geom_bar()

mean(dat$x)
## [1] 2.7872
var(dat$x)
## [1] 2.802196

The Poisson distribution takes a single parameter, lambda (\(\lambda\)). That parameter controls exactly the shape of the distribution, and is equal to the expected mean and the expected variance both. That is a key issue. In a Poisson distribution, the expected mean predicts the variance. There is not room for any more variation.

In biology, this lambda is unlikely to be constant. A Poisson GLM “thinks” the predictors in the model capture all the variation in lambda, which corresponds to this set of equations:

\[ \mathrm{obs} \sim Poisson(\lambda) \\ \lambda = \mathrm{exp}(y)\\ y = \alpha + \beta x \]

That is often not sufficient. The data would probably be better represented by

\[ \mathrm{obs} \sim Poisson(\lambda) \\ \lambda = \mathrm{exp}(y)\\ y = \alpha + \beta x + noise \] That is, there is unexplained variation after accounting for the model predictors. A Poisson GLM cannot see the noise, which produces over-dispersion, and measure uncertainty using only the variation it can see. Therefore, a Poisson GLM often under-estimate uncertainty.

If you forget about the problem of over-dispersion you will find too many false positive results. For instance, with a variance about 3 times larger than the mean, we find significant results 30% of the time when there is no true effect (ideally we expect 5%).

## Simulate data without any effect of x
pvalues <- vector(length=100)
for (i in 1:100)
{
  x <- rnorm(50)
  y <- rnorm(50)
  lambda <- exp(y)
  obs <- sapply(lambda, rpois, n=1)
  
  m0 <- glmmTMB(obs ~ x, family = "poisson")
  pvalues[i] <- summary(m0)$coefficients$cond["x", "Pr(>|z|)"]
}
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended

## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
mean(pvalues < 0.05)
## [1] 0.31

How to solve the issue of over-dispersion?

Basically, we need to give some freedom to the mean-variance relationship. There are many ways of doing that; some will work better or worse for particular datasets. There is no silver bullet. You need to try and assess how good the model is.

A first version, called “nbinom1” (“Negative-Binomial 1”), is an extension of the Poisson model in which the variance follows the equation $ V= m (1+)$ where \(\phi\) is the over-dispersion.

summary(m_poisson_2 <- glmmTMB(No.Cop_Success ~ 1 + Treatment,
                             data=gambusia, family = nbinom1))
##  Family: nbinom1  ( log )
## Formula:          No.Cop_Success ~ 1 + Treatment
## Data: gambusia
## 
##      AIC      BIC   logLik deviance df.resid 
##    454.2    462.8   -224.1    448.2      123 
## 
## 
## Dispersion parameter for nbinom1 family (): 4.09 
## 
## Conditional model:
##                 Estimate Std. Error z value Pr(>|z|)   
## (Intercept)       0.5218     0.1974   2.644  0.00819 **
## TreatmentWinner   0.1621     0.2405   0.674  0.50038   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s evaluate this model using DHARMa and then looking at simulated distributions:

testResiduals(m_poisson) # previous evaluation

## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.3233, p-value = 7.277e-12
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 3.8684, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa bootstrapped outlier test
## 
## data:  simulationOutput
## outliers at both margin(s) = 7, observations = 126, p-value < 2.2e-16
## alternative hypothesis: two.sided
##  percent confidence interval:
##  0.00000000 0.01587302
## sample estimates:
## outlier frequency (expected: 0.00341269841269841 ) 
##                                         0.05555556
## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.3233, p-value = 7.277e-12
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 3.8684, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa bootstrapped outlier test
## 
## data:  simulationOutput
## outliers at both margin(s) = 7, observations = 126, p-value < 2.2e-16
## alternative hypothesis: two.sided
##  percent confidence interval:
##  0.00000000 0.01587302
## sample estimates:
## outlier frequency (expected: 0.00341269841269841 ) 
##                                         0.05555556
testResiduals(m_poisson_2) #much better

## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.067347, p-value = 0.6171
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 0.76617, p-value = 0.496
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa bootstrapped outlier test
## 
## data:  simulationOutput
## outliers at both margin(s) = 0, observations = 126, p-value = 1
## alternative hypothesis: two.sided
##  percent confidence interval:
##  0.00000000 0.02380952
## sample estimates:
## outlier frequency (expected: 0.00611111111111111 ) 
##                                                  0
## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.067347, p-value = 0.6171
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 0.76617, p-value = 0.496
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa bootstrapped outlier test
## 
## data:  simulationOutput
## outliers at both margin(s) = 0, observations = 126, p-value = 1
## alternative hypothesis: two.sided
##  percent confidence interval:
##  0.00000000 0.02380952
## sample estimates:
## outlier frequency (expected: 0.00611111111111111 ) 
##                                                  0
fsimul <- function(x) {
  tibble(simul=simulate(m_poisson)) %>% 
  tabyl(simul)
  }

tibbsimul2 <- map_dfr(.x = 1:100, .f = fsimul)

ggplot(gambusia, aes(x=No.Cop_Success))+geom_bar() +
  geom_jitter(data = tibbsimul2, 
             mapping = aes(x=simul, y=n, col=as.factor(simul)),
             alpha=0.4, inherit.aes = FALSE) + ggtitle("Poisson GLM")

fsimul <- function(x) {
  tibble(simul=simulate(m_poisson_2)) %>% 
  tabyl(simul)
  }

tibbsimul2 <- map_dfr(.x = 1:100, .f = fsimul)

ggplot(gambusia, aes(x=No.Cop_Success))+geom_bar() +
  geom_jitter(data = tibbsimul2, 
             mapping = aes(x=simul, y=n, col=as.factor(simul)),
             alpha=0.4, inherit.aes = FALSE) + xlim(-1,12)+ ggtitle("Nbinom1 GLM")
## Warning: Removed 1 rows containing missing values (geom_bar).
## Warning: Removed 201 rows containing missing values (geom_point).

In this case the “Negative Binomial 1” seems appropriate, and much better than the strict Poisson family.

In general avoid the Poisson family.

(unless you model over-dispersion within the Poisson family, using random effects or a secondary dispersion linear predictor; we will not cover this topic in the course. Just remember it is dangerous but not necessarily wrong if you know what you are doing!)

Other types of count data GLMs

Families

In glmmTMB there are 5 basic GLM families for count data:

summary(m_poisson <- glmmTMB(No.Cop_Success ~ 1 + Treatment, data=gambusia, family = poisson))
##  Family: poisson  ( log )
## Formula:          No.Cop_Success ~ 1 + Treatment
## Data: gambusia
## 
##      AIC      BIC   logLik deviance df.resid 
##    624.9    630.6   -310.5    620.9      124 
## 
## 
## Conditional model:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      0.52030    0.09713   5.357 8.47e-08 ***
## TreatmentWinner  0.16487    0.13204   1.249    0.212    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(m_genpoi <- glmmTMB(No.Cop_Success ~ 1 + Treatment, data=gambusia, family = genpois)) 
##  Family: genpois  ( log )
## Formula:          No.Cop_Success ~ 1 + Treatment
## Data: gambusia
## 
##      AIC      BIC   logLik deviance df.resid 
##    458.7    467.2   -226.4    452.7      123 
## 
## 
## Dispersion parameter for genpois family (): 5.84 
## 
## Conditional model:
##                 Estimate Std. Error z value Pr(>|z|)  
## (Intercept)       0.5193     0.2050   2.533   0.0113 *
## TreatmentWinner   0.1668     0.2389   0.698   0.4852  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(m_compois <- glmmTMB(No.Cop_Success ~ 1 + Treatment, data=gambusia, family = compois))## Slow!
##  Family: compois  ( log )
## Formula:          No.Cop_Success ~ 1 + Treatment
## Data: gambusia
## 
##      AIC      BIC   logLik deviance df.resid 
##    469.0    477.5   -231.5    463.0      123 
## 
## 
## Dispersion parameter for compois family (): 1.74e+09 
## 
## Conditional model:
##                 Estimate Std. Error z value Pr(>|z|)   
## (Intercept)       0.5203     0.1591   3.271  0.00107 **
## TreatmentWinner   0.1649     0.2218   0.743  0.45720   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(m_nbinom1 <- glmmTMB(No.Cop_Success ~ 1 + Treatment, data=gambusia, family = nbinom1))
##  Family: nbinom1  ( log )
## Formula:          No.Cop_Success ~ 1 + Treatment
## Data: gambusia
## 
##      AIC      BIC   logLik deviance df.resid 
##    454.2    462.8   -224.1    448.2      123 
## 
## 
## Dispersion parameter for nbinom1 family (): 4.09 
## 
## Conditional model:
##                 Estimate Std. Error z value Pr(>|z|)   
## (Intercept)       0.5218     0.1974   2.644  0.00819 **
## TreatmentWinner   0.1621     0.2405   0.674  0.50038   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(m_nbinom2 <- glmmTMB(No.Cop_Success ~ 1 + Treatment, data=gambusia, family = nbinom2))
##  Family: nbinom2  ( log )
## Formula:          No.Cop_Success ~ 1 + Treatment
## Data: gambusia
## 
##      AIC      BIC   logLik deviance df.resid 
##    454.4    462.9   -224.2    448.4      123 
## 
## 
## Dispersion parameter for nbinom2 family (): 0.447 
## 
## Conditional model:
##                 Estimate Std. Error z value Pr(>|z|)  
## (Intercept)       0.5203     0.2119   2.455   0.0141 *
## TreatmentWinner   0.1649     0.2973   0.555   0.5792  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All of them return (almost exactly) the same parameter estimates: intercept = 0.52 and TreatmentWinner=0.16. However, they return different Standard Errors, z values and p values (the poisson family being very different from the other ones), because they have different ideas about how variation in the data works and therefore about how sure we should be about what patterns we see.

Each family may be more or less appropriate for different response variables. It really depends on how variation relates to the response expected mean value. Sometimes, not a single one of these families will provide a good description of variation. In those cases adding random effect, or using a family with zero-inflation or a linear predictor for dispersion can be necessary (not covered in this course).

Predictors

Sometimes a count data GLM does not fit well and shows signs of under/over-dispersion because important predictors are missing in the model.

For instance, let’s simulate data where a count data “obs” is a function of size and size on the log-scale:

set.seed(351188)
fakedata <- tibble(size=rnorm(1000), sex=sample(c(1,2), size=1000, replace = TRUE, prob = c(0.2, 0.8)), 
                   y=1 + 0.5*size + 1.6*sex + rnorm(n = 1000, 0, 0.02), obs=rpois(n = 1000, lambda = exp(y)))

We model obs as a function of size only:

summary(fake_nbinom1 <- glmmTMB(obs ~ 1 + size, data=fakedata, family = nbinom1)) 
##  Family: nbinom1  ( log )
## Formula:          obs ~ 1 + size
## Data: fakedata
## 
##      AIC      BIC   logLik deviance df.resid 
##   9595.1   9609.8  -4794.5   9589.1      997 
## 
## 
## Dispersion parameter for nbinom1 family (): 17.8 
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.03847    0.01891  213.61   <2e-16 ***
## size         0.44591    0.01668   26.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
testResiduals(fake_nbinom1)

## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.276, p-value < 2.2e-16
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 0.70967, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa outlier test based on exact binomial test with approximate
##  expectations
## 
## data:  simulationOutput
## outliers at both margin(s) = 10, observations = 1000, p-value = 0.472
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
##  0.004805511 0.018313243
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 ) 
##                                                   0.01
## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.276, p-value < 2.2e-16
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 0.70967, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa outlier test based on exact binomial test with approximate
##  expectations
## 
## data:  simulationOutput
## outliers at both margin(s) = 10, observations = 1000, p-value = 0.472
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
##  0.004805511 0.018313243
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 ) 
##                                                   0.01

Here we have a case of under-dispersion. The model does not fit the data well and over-estimates the amount of noise in the data. In the presence of under-dispersion we expect to find fewer false positives than normal, but also more false negative; that is, our statistical model has less power to detect effects than it should.

Sometimes you can still get an okay model by fitting a different GLM, but adding the missing predictor in the model solves the issue much better:

summary(fake_nbinom1_sex <- glmmTMB(obs ~ 1 + size + sex, data=fakedata, family = nbinom1)) 
##  Family: nbinom1  ( log )
## Formula:          obs ~ 1 + size + sex
## Data: fakedata
## 
##      AIC      BIC   logLik deviance df.resid 
##   6757.8   6777.4  -3374.9   6749.8      996 
## 
## 
## Dispersion parameter for nbinom1 family (): 0.0503 
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 1.000227   0.036553   27.36   <2e-16 ***
## size        0.497551   0.004066  122.37   <2e-16 ***
## sex         1.597368   0.018613   85.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
testResiduals(fake_nbinom1_sex)

## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.012523, p-value = 0.9976
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 1.0193, p-value = 0.68
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa outlier test based on exact binomial test with approximate
##  expectations
## 
## data:  simulationOutput
## outliers at both margin(s) = 8, observations = 1000, p-value = 0.8594
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
##  0.003459976 0.015702049
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 ) 
##                                                  0.008
## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.012523, p-value = 0.9976
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 1.0193, p-value = 0.68
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa outlier test based on exact binomial test with approximate
##  expectations
## 
## data:  simulationOutput
## outliers at both margin(s) = 8, observations = 1000, p-value = 0.8594
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
##  0.003459976 0.015702049
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 ) 
##                                                  0.008

With the same data, if we fit sex only but not size, we get over-dispersion:

summary(fake_nbinom1_sex_only <- glmmTMB(obs ~ 1 + sex, data=fakedata, family = nbinom1)) 
##  Family: nbinom1  ( log )
## Formula:          obs ~ 1 + sex
## Data: fakedata
## 
##      AIC      BIC   logLik deviance df.resid 
##   9429.3   9444.1  -4711.7   9423.3      997 
## 
## 
## Dispersion parameter for nbinom1 family (): 15.4 
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.67100    0.10798   15.47   <2e-16 ***
## sex          1.32142    0.05522   23.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
testResiduals(fake_nbinom1_sex_only)

## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.052034, p-value = 0.008897
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 1.3403, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa outlier test based on exact binomial test with approximate
##  expectations
## 
## data:  simulationOutput
## outliers at both margin(s) = 11, observations = 1000, p-value = 0.2807
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
##  0.005503594 0.019596628
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 ) 
##                                                  0.011
## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.052034, p-value = 0.008897
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 1.3403, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa outlier test based on exact binomial test with approximate
##  expectations
## 
## data:  simulationOutput
## outliers at both margin(s) = 11, observations = 1000, p-value = 0.2807
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
##  0.005503594 0.019596628
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 ) 
##                                                  0.011

The issue disappears when we include size in the model:

summary(fake_nbinom1_sex <- glmmTMB(obs ~ 1 + size + sex, data=fakedata, family = nbinom1)) 
##  Family: nbinom1  ( log )
## Formula:          obs ~ 1 + size + sex
## Data: fakedata
## 
##      AIC      BIC   logLik deviance df.resid 
##   6757.8   6777.4  -3374.9   6749.8      996 
## 
## 
## Dispersion parameter for nbinom1 family (): 0.0503 
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 1.000227   0.036553   27.36   <2e-16 ***
## size        0.497551   0.004066  122.37   <2e-16 ***
## sex         1.597368   0.018613   85.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
testResiduals(fake_nbinom1_sex)

## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.012523, p-value = 0.9976
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 1.0193, p-value = 0.68
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa outlier test based on exact binomial test with approximate
##  expectations
## 
## data:  simulationOutput
## outliers at both margin(s) = 8, observations = 1000, p-value = 0.8594
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
##  0.003459976 0.015702049
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 ) 
##                                                  0.008
## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.012523, p-value = 0.9976
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 1.0193, p-value = 0.68
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa outlier test based on exact binomial test with approximate
##  expectations
## 
## data:  simulationOutput
## outliers at both margin(s) = 8, observations = 1000, p-value = 0.8594
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
##  0.003459976 0.015702049
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 ) 
##                                                  0.008

Time for practice!